AITopics | learning image

Collaborating Authors

learning image

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

Neural Information Processing SystemsDec-24-2025, 04:53:30 GMT

Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents e.g., Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents by aligning both modalities in a shared embedding space. In order to distill discriminative visual words from noisy documents, we introduce a new cross-modal attention module that learns fine-grained interactions between image patches and document words. Consequently, our I2DFormer not only learns highly discriminative document embeddings that capture visual similarities but also gains the ability to localize visually relevant words in image regions. Quantitatively, we demonstrate that our I2DFormer significantly outperforms previous unsupervised semantic embeddings under both zero-shot and generalized zero-shot learning settings on three public datasets. Qualitatively, we show that our method leads to highly interpretable results where document words can be grounded in the image regions.

document attention, i2dformer, learning image, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Supplementary Material I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

Neural Information Processing SystemsAug-14-2025, 19:17:54 GMT

Our novel I2DFormer which only utilizes the I2DAttention module, outperforms all these baselines in row d). Our model is designed with the problem constraints of our ZSL setting and the resulting information asymmetry in mind.

i2dattention module, i2dformer, information, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.05)
South America > Brazil (0.04)
Oceania > New Zealand (0.04)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.65)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.42)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.40)

Add feedback

I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

Neural Information Processing SystemsOct-11-2024, 00:55:15 GMT

document attention, i2dformer, zero-shot image classification, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Learning Images Across Scales Using Adversarial Training

Wolski, Krzysztof, Djeacoumar, Adarsh, Javanmardi, Alireza, Seidel, Hans-Peter, Theobalt, Christian, Cordonnier, Guillaume, Myszkowski, Karol, Drettakis, George, Pan, Xingang, Leimkühler, Thomas

arXiv.org Artificial IntelligenceJun-13-2024

The real world exhibits rich structure and detail across many scales of observation. It is difficult, however, to capture and represent a broad spectrum of scales using ordinary images. We devise a novel paradigm for learning a representation that captures an orders-of-magnitude variety of scales from an unstructured collection of ordinary images. We treat this collection as a distribution of scale-space slices to be learned using adversarial training, and additionally enforce coherency across slices. Our approach relies on a multiscale generator with carefully injected procedural frequency content, which allows to interactively explore the emerging continuous scale space. Training across vastly different scales poses challenges regarding stability, which we tackle using a supervision scheme that involves careful sampling of scales. We show that our generator can be used as a multiscale generative model, and for reconstructions of scale spaces from unstructured patches. Significantly outperforming the state of the art, we demonstrate zoom-in factors of up to 256x at high quality and scale consistency.

acm trans, informatik, scale space, (15 more...)

arXiv.org Artificial Intelligence

2406.08924

Country:

Europe > Germany > Saarland > Saarbrücken (0.05)
Europe > France > Provence-Alpes-Côte d'Azur (0.05)
Europe > Spain (0.04)
(6 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback